How do I properly size CPU & memory for my workloads
Objective
This document goal is to facilitate findingrunning Documentation Solution
Can be challenging to find the right amount of CPU/memory/GPU resources to certain AI Workloads.
Precision, batch size, model size, and context length are all tightly coupled to how much resource (especially GPU memory) is needed.
Recommended Values:
AI Workload | CPU Cores | Memory (GB) | GPU number | VRAM per GPU (GB) | Notes |
---|---|---|---|---|---|
Llama-7B | 8–16 | 32–64 | 1 | 16+ | Single GPU sufficient; fits L40, L40S, H100, H200 |
Llama-70B (FP16) | 16–32 | 128–256 | 2 (H100) | 80 | Or 1× H200 (141 GB), 3× L40 (48 GB each) |
Llama-70B (Quantized 8-bit) | 16–32 | 128–256 | 1 (H100) | 80 | Or 2× L40 (48 GB each), depends on batch size |
VLLM – Inference Server | 16–32 | 64–128 | Model-dependent | Model-dependent | See model requirements; e.g., 2× H100 for 70B |
Nvidia NIM | 16–32 | 64–128 | Model-dependent | Model-dependent | 2 x H100 |
Infinity Server (Embeddings) | 8–16 | 32–64 | 1 | 8–16 | Fits L40, L40S, H100, H200; often overprovisioned |
Invoke (Image Generation) | 8–16 | 32–64 | 1 | 8+ | Preferably 16 GB; fits all specified GPUs |